Skip to content

Feat/ci tests v2 framework cpu hog#1180

Open
ddjain wants to merge 20 commits intokrkn-chaos:mainfrom
ddjain:feat/ci-tests-v2-framework-cpu-hog
Open

Feat/ci tests v2 framework cpu hog#1180
ddjain wants to merge 20 commits intokrkn-chaos:mainfrom
ddjain:feat/ci-tests-v2-framework-cpu-hog

Conversation

@ddjain
Copy link
Collaborator

@ddjain ddjain commented Mar 6, 2026

Type of change

  • Refactor
  • New feature
  • Bug fix
  • Optimization

Description

Scenario

  • CPU hog (hog_scenarios): runs a CPU stress workload on selected nodes for a set duration, then removes the hog pods.

Test cases

  1. Success and lifecycle – Scenario runs, at least one hog pod appears during the run, process exits 0, no hog pods left after run.
  2. Node selector and duration – Scenario runs with node-selector=kubernetes.io/os=linux and duration=10; exit 0 and run time in expected range (~8–90s).
  3. Invalid node selector – Node selector matches no nodes; Krkn exits with failure (non-zero).
  4. Invalid scenario YAML – Invalid scenario file; Krkn exits with failure (non-zero).
    Tests use ephemeral namespaces (no pre-deployed workload)

Related Tickets & Documents

If no related issue, please create one and start the converasation on wants of

  • Related Issue #:
  • Closes #:

ddjain and others added 20 commits February 24, 2026 16:07
…isolation

Signed-off-by: ddjain <darjain@redhat.com>
Signed-off-by: ddjain <darjain@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: ddjain <darjain@redhat.com>
Signed-off-by: ddjain <darjain@redhat.com>
Signed-off-by: ddjain <darjain@redhat.com>
Signed-off-by: ddjain <darjain@redhat.com>
Signed-off-by: ddjain <darjain@redhat.com>
Signed-off-by: ddjain <darjain@redhat.com>
Signed-off-by: ddjain <darjain@redhat.com>
Signed-off-by: ddjain <darjain@redhat.com>
Signed-off-by: ddjain <darjain@redhat.com>
Signed-off-by: ddjain <darjain@redhat.com>
Signed-off-by: ddjain <darjain@redhat.com>
Signed-off-by: ddjain <darjain@redhat.com>
Signed-off-by: ddjain <darjain@redhat.com>
Signed-off-by: ddjain <darjain@redhat.com>
@ddjain ddjain marked this pull request as ready for review March 9, 2026 10:49
@qodo-code-review
Copy link

Review Summary by Qodo

Add CPU hog scenario functional tests to v2 framework

✨ Enhancement 🧪 Tests

Grey Divider

Walkthroughs

Description
• Add CPU hog scenario functional tests with ephemeral namespace isolation
• Implement test cases for success lifecycle, node selector, duration validation
• Add failure scenarios for invalid node selector and malformed YAML
• Create base scenario configuration for CPU hog stress testing
Diagram
flowchart LR
  A["CPU Hog Test Suite"] --> B["Success & Lifecycle Test"]
  A --> C["Node Selector & Duration Test"]
  A --> D["Invalid Node Selector Test"]
  A --> E["Invalid Scenario YAML Test"]
  B --> F["Verify Pod Creation & Cleanup"]
  C --> G["Validate Timing & Exit Code"]
  D --> H["Assert Kraken Failure"]
  E --> H
  A --> I["Base Scenario Config"]
Loading

Grey Divider

File Changes

1. CI/tests_v2/scenarios/cpu_hog/test_cpu_hog.py 🧪 Tests +121/-0

CPU hog scenario functional test suite implementation

• Implement four functional test cases for CPU hog scenario execution
• Test pod lifecycle: creation during run and cleanup after completion
• Validate node selector targeting and duration constraints
• Verify graceful failure handling for invalid configurations
• Add helper functions to poll and retrieve hog pods from namespace

CI/tests_v2/scenarios/cpu_hog/test_cpu_hog.py


2. CI/tests_v2/pytest.ini ⚙️ Configuration changes +1/-0

Add cpu_hog pytest marker configuration

• Add cpu_hog pytest marker for CPU hog scenario test identification
• Enable selective test execution and filtering by scenario type

CI/tests_v2/pytest.ini


3. CI/tests_v2/scenarios/cpu_hog/scenario_base.yaml ⚙️ Configuration changes +12/-0

Base CPU hog scenario configuration template

• Define base CPU hog scenario configuration with default parameters
• Set duration, worker count, CPU load percentage, and image reference
• Configure namespace and node selector targeting for test execution
• Specify hog type as CPU with all CPU methods and single node target

CI/tests_v2/scenarios/cpu_hog/scenario_base.yaml


Grey Divider

Qodo Logo

@qodo-code-review
Copy link

qodo-code-review bot commented Mar 9, 2026

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Action required

1. Unhandled kraken timeout 🐞 Bug ⛯ Reliability
Description
test_cpu_hog_success_and_lifecycle calls proc.communicate(timeout=90) without handling subprocess
timeout, so a slow/hung kraken run will raise and leave the background process running (with
stdout/stderr pipes) and can hang the overall test session.
Code

CI/tests_v2/scenarios/cpu_hog/test_cpu_hog.py[R56-64]

+        proc = self.run_kraken_background(config_path)
+        try:
+            pods = _wait_for_hog_pod(
+                self.k8s_core, ns, self.HOG_POD_PREFIX, timeout=POLICY_WAIT_TIMEOUT
+            )
+            assert len(pods) >= 1, f"Expected at least one hog pod in namespace={ns}"
+        finally:
+            # duration=10 + pod wait (30s) + cleanup; allow 90s for Krkn to exit.
+            stdout, stderr = proc.communicate(timeout=90)
Evidence
The test starts kraken using a Popen with stdout/stderr pipes and then unconditionally calls
communicate(timeout=90) in a finally block without any termination/kill fallback, so a
TimeoutExpired will abort cleanup and can leave the subprocess alive. The suite already defines
overridable timeout constants (including KRAKEN_PROC_WAIT_TIMEOUT), but this test hard-codes 90
seconds, making behavior inconsistent when env overrides are used.

CI/tests_v2/scenarios/cpu_hog/test_cpu_hog.py[45-65]
CI/tests_v2/lib/kraken.py[44-58]
CI/tests_v2/lib/base.py[38-46]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`test_cpu_hog_success_and_lifecycle` starts Kraken in the background and calls `proc.communicate(timeout=90)` without handling `subprocess.TimeoutExpired`. If Kraken hangs or runs longer than expected, the test errors out and can leave the subprocess running (with stdout/stderr pipes), which may stall the test session and leak cluster resources.

### Issue Context
- `run_kraken_background` uses `stdout=PIPE` and `stderr=PIPE`.
- The test suite already defines configurable timeout constants (`KRAKEN_PROC_WAIT_TIMEOUT`, `TIMEOUT_BUDGET`, etc.) via env vars.

### Fix Focus Areas
- CI/tests_v2/scenarios/cpu_hog/test_cpu_hog.py[56-69]
- CI/tests_v2/lib/base.py[38-46]

### Suggested implementation notes
- Wrap `proc.communicate(...)` in `try/except subprocess.TimeoutExpired`.
- On timeout: `proc.terminate()` then `proc.kill()` if still running; drain output; then fail with a clear message.
- Replace the hard-coded `90` with `KRAKEN_PROC_WAIT_TIMEOUT` or a computed timeout derived from `scenario[&#x27;duration&#x27;]` plus a buffer.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Logs dropped on pod-wait 🐞 Bug ✧ Quality
Description
If no hog pod appears and _wait_for_hog_pod raises TimeoutError, the test never reaches the code
that wraps stdout/stderr and calls assert_kraken_success, so the kraken output captured in the
finally block is not persisted or shown, making CI failures hard to debug.
Code

CI/tests_v2/scenarios/cpu_hog/test_cpu_hog.py[R56-70]

+        proc = self.run_kraken_background(config_path)
+        try:
+            pods = _wait_for_hog_pod(
+                self.k8s_core, ns, self.HOG_POD_PREFIX, timeout=POLICY_WAIT_TIMEOUT
+            )
+            assert len(pods) >= 1, f"Expected at least one hog pod in namespace={ns}"
+        finally:
+            # duration=10 + pod wait (30s) + cleanup; allow 90s for Krkn to exit.
+            stdout, stderr = proc.communicate(timeout=90)
+        result = SimpleNamespace(
+            returncode=proc.returncode,
+            stdout=stdout or "",
+            stderr=stderr or "",
+        )
+        assert_kraken_success(result, context=f"namespace={ns}", tmp_path=self.tmp_path)
Evidence
The test waits for hog pods inside a try and then builds the result object and calls
assert_kraken_success only after the try/finally. Any exception from _wait_for_hog_pod exits the
test before the result/assertion path, and unlike assert_kraken_success (which writes logs to
tmp_path on failure), there is no log persistence in this exception path.

CI/tests_v2/scenarios/cpu_hog/test_cpu_hog.py[56-70]
CI/tests_v2/lib/utils.py[166-188]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
When `_wait_for_hog_pod(...)` times out, the test raises `TimeoutError` before it creates the `result` object and calls `assert_kraken_success`. Although `stdout/stderr` are obtained in the `finally` block, they are not persisted or surfaced on this exception path.

### Issue Context
`assert_kraken_success`/`assert_kraken_failure` already have a convention of writing `kraken_stdout.log` / `kraken_stderr.log` to `tmp_path`, but this path bypasses those helpers.

### Fix Focus Areas
- CI/tests_v2/scenarios/cpu_hog/test_cpu_hog.py[56-75]
- CI/tests_v2/lib/utils.py[166-188]

### Suggested implementation notes
- Capture exceptions from the pod-wait block (e.g., `except Exception as e: exc = e`) and in `finally` write `stdout/stderr` to `tmp_path` before re-raising/failing.
- Alternatively, convert the timeout into an `AssertionError` that includes the last N lines of kraken stdout/stderr and points to the log files.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment on lines +56 to +64
proc = self.run_kraken_background(config_path)
try:
pods = _wait_for_hog_pod(
self.k8s_core, ns, self.HOG_POD_PREFIX, timeout=POLICY_WAIT_TIMEOUT
)
assert len(pods) >= 1, f"Expected at least one hog pod in namespace={ns}"
finally:
# duration=10 + pod wait (30s) + cleanup; allow 90s for Krkn to exit.
stdout, stderr = proc.communicate(timeout=90)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Unhandled kraken timeout 🐞 Bug ⛯ Reliability

test_cpu_hog_success_and_lifecycle calls proc.communicate(timeout=90) without handling subprocess
timeout, so a slow/hung kraken run will raise and leave the background process running (with
stdout/stderr pipes) and can hang the overall test session.
Agent Prompt
### Issue description
`test_cpu_hog_success_and_lifecycle` starts Kraken in the background and calls `proc.communicate(timeout=90)` without handling `subprocess.TimeoutExpired`. If Kraken hangs or runs longer than expected, the test errors out and can leave the subprocess running (with stdout/stderr pipes), which may stall the test session and leak cluster resources.

### Issue Context
- `run_kraken_background` uses `stdout=PIPE` and `stderr=PIPE`.
- The test suite already defines configurable timeout constants (`KRAKEN_PROC_WAIT_TIMEOUT`, `TIMEOUT_BUDGET`, etc.) via env vars.

### Fix Focus Areas
- CI/tests_v2/scenarios/cpu_hog/test_cpu_hog.py[56-69]
- CI/tests_v2/lib/base.py[38-46]

### Suggested implementation notes
- Wrap `proc.communicate(...)` in `try/except subprocess.TimeoutExpired`.
- On timeout: `proc.terminate()` then `proc.kill()` if still running; drain output; then fail with a clear message.
- Replace the hard-coded `90` with `KRAKEN_PROC_WAIT_TIMEOUT` or a computed timeout derived from `scenario['duration']` plus a buffer.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant